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Abstract 

Statistical model checking techniques have been shown to be effective for approximate model checking on large 
stochastic systems, where explicit representation of the state space is impractical. Importantly, these techniques 
ensure the validity of results with statistical guarantees on errors. There is an increasing interest in these classes of 
algorithms in computational systems biology since analysis using traditional model checking techniques does not 
scale well. In this context, we present two improvements to existing statistical model checking algorithms. Firstly, 
we construct an algorithm which removes the need of the user to define the indifference region, a critical 
parameter in previous sequential hypothesis testing algorithms. Secondly, we extend the algorithm to account for 
the case when there may be a limit on the computational resources that can be spent on verifying a property; i.e, 
if the original algorithm is not able to make a decision even after consuming the available amount of resources, 
we resort to a p-value based approach to make a decision. We demonstrate the improvements achieved by our 
algorithms in comparison to current algorithms first with a straightforward yet representative example, followed by 
a real biological model on cell fate of gustatory neurons with microRNAs. 



Introduction 

Model checking is an automated method to formally verify 
a system's behavior. It is a technique widely used to vali- 
date logic circuits, communication protocols and software 
drivers [1]. Usually, the system to be analyzed is encoded 
in a specification language suitable for automated explora- 
tion, and the properties to be verified are specified as for- 
mulas in temporal logics. Given a model of the system and 
a temporal logic formula, the model checker systematically 
explores the state space of the model to check if the speci- 
fied property is satisfied. If the property holds, the model 
checker returns the value true; otherwise, the model 
checker returns a false value, with a counter example of a 
specific trace of the system where the property failed. 

Recently there have been efforts to apply model check- 
ing in computational systems biology [2-7]. In this context, 
probabilistic models - such as Discrete Time Markov 
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Chain (DTMC) or Continuous Time Markov Chain 
(CTMC) - are often used and, properties are expressed 
with specialized probabilistic temporal logics that quantify 
the properties with probability. We refer to this as 
probabilistic model checking. 

Usually probabilistic model checking is solved using 
numerical solution techniques, and typically involves itera- 
tively computing the exact probability of paths satisfying 
appropriate sub-formulas. There are several efficient opti- 
mizations to represent the state space of these models 
compactly, and to traverse the state space efficiently. How- 
ever, they are usually very memory intensive and do not 
scale well to large stochastic models. Hence, approximate 
methods for solving such problems are often used. One 
such class of methods, known as statistical model checking, 
relies on using, as the name suggests, statistical techniques 
to perform model checking. It is based on simulating a 
number of sample runs of the system and, subsequently, 
deciding whether the samples provide enough evidence to 
suggest the validity or invalidity of the property specified 
as a probabilistic temporal logic formula [8] . 



o 
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Statistical model checking is based on the crucial 
observation that it may not be necessary to obtain an 
absolute accurate estimate of a probability in order to 
verify probabilistic properties. For example, to verify if 
the probability of a random variable exhibiting a certain 
behavior is greater than 9, it is not necessary to com- 
pute the exact probability of the property (p) to hold; 
instead, it is enough if we infer, by sufficiently sampling 
the underlying model, that the probability is safely 
above or below 9. Approaches based on statistical model 
checking are proven to be scalable, since they are not 
dependent on constructing and traversing the full state 
space of the model. Additionally, they have a low time 
complexity, require low memory, and are tunable to the 
desired accuracy. These factors make them ideal for per- 
forming analysis on large complex stochastic systems. 

Since computational pathway models are typically 
large complex stochastic models, our focus in this paper 
is on the statistical model checking problem. A standard 
version of statistical model checking, which is the one 
we focus on in this paper, is called sequential hypothesis 
testing [9,10]. The success of this approach depends lar- 
gely on a user-defined parameter called the indifference 
region. The choice of the indifference region dictates 
the number of samples necessary to verify the property 
and the outcome of the verification task. Consequently, 
it will be helpful to have a method of specifying the 
indifference region that does not solely depend on the 
user-input. 

Furthermore, when the true probability of the property 
is very close to the probability specified in the formulas, a 
large number of simulations is needed to validate or invali- 
date the property. Maintaining an optimal balance 
between computational effort and precision is important. 
It may well be the case with existing algorithms that, to 
satisfy the specified error bounds, a large number of sam- 
ples are drawn. In such cases, it will be useful to return a 
reasonable answer once a pre-specified amount of compu- 
tational resources have been consumed while the statistical 
test required is unable to make a decision yet. 

To address these issues, we propose optimized 
sequential hypothesis testing algorithms which 1) do not 
need the user to provide the indifference region para- 
meter; 2) adjusts to the difficulty of the problem, i.e. the 
distance between p (the true probability) and 6 (prob- 
ability dictated by the property) dynamically; 3) always 
provides a definite true or false result, i.e, does not 
return the undesirable undecided result (or "I do not 
know" response). 

Related work 

Existing works on statistical model checking can be clas- 
sified based on whether the probabilistic system is a 



black-box or a white-box system. A white-box system 
allows generation of as many trajectories of the system 
as desired. In a black-box system, only a fixed number 
of trajectories is available and, using which a decision 
has to be made. We establish the basic concepts and 
terminologies to be used in the rest of the paper in this 
section. Formally, probabilistic model checking refers to 
the problem of verifying if M = Pr M {y/}, A e {<, >, >, <}; 
i.e, given a probabilistic model M, and a property y/ 
encoded in a probabilistic temporal logic formalism, 
check whether yj holds in M with probability dictated by 
A w.r.t to 9. 

Black-box systems 

Statistical model checking on black-box systems is based 
on calculating a p-value that quantifies the statistical evi- 
dence of satisfaction or rejection of a hypothesis using 
the set of samples given [11,12]. Sen et al gives an algo- 
rithm for black box systems which quantifies the evi- 
dence of satisfaction of the formula by a p-value [11]. 
The problem is formulated as solving two separate 
hypothesis tests {HO: p <9 against HI: p > 9). If £ x/n > 9 
(where x t is 1 if the ith sample satisfies yt and 0 if it does 
not), HO is rejected, the formula is declared to hold, and 
the p-value is calculated. If the test does not reject HO 
then a second experiment is conducted, with HO: p > 9 
against HI: p < G. If £ x/n < 9, HO is rejected, the for- 
mula is declared false, and the corresponding p-value is 
calculated. The smaller the p-value, the greater is the 
confidence in the decision. 

Younes also discusses an algorithm for black box sys- 
tems using a modified version of single-sampling plan 
with p-value [13]. Younes proposes the single-sampling- 
based hypothesis testing algorithm where the number of 
samples n is decided upfront. The model checking pro- 
blem is formulated as a hypothesis test with the null 
hypothesis HO: p > 9 against the alternate hypothesis HI: p 
< 9, a constant c is also specified that decides the number 
of samples that should evaluate to true to accept the null 
hypotheses. Let X t be a Bernoulli random variable with 
parameter p such that Pr[X t = 1] = p and Pr[X t = 0] = 1 - 
p. An observation/sample of X h represented as x it states 
whether the specified temporal logic formula is true or 
false for a particular observation. For example, in this case, 
Xi is 1 if the ith sample satisfies yj and 0 if it does not. The 
strength of the hypothesis test is decided by parameters a 
and [3, which represent the probability of false negatives 
(Type-1 error) and false positives (Type-2 error) respec- 
tively. If Yli=i x < > c tnen tne hypothesis HO is accepted; 
else HI is accepted. The main challenge is to find the pair 
(n, c) which obey the error bounds {a, fi). Younes 
describes an algorithm based on binary search to find the 
pair {n, c) that obeys the bounds [13]. 
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White-box systems 

Model checking on white-box systems can be classified 
into those which are based on either statistical estima- 
tion or hypothesis testing. Statistical estimation based 
methods rely on getting an estimate of the true prob- 
ability, p, and comparing it with 8 (dictated by the tem- 
poral logic formula) to make a decision [14]. Algorithms 
based on hypothesis testing formulate the model check- 
ing problem into a standard hypothesis test between a 
null and alternate hypothesis. Using techniques devel- 
oped for solving hypothesis testing problems, a decision 
is made about the satisfiability of the property. Methods 
based on hypothesis testing can be further subdivided 
into two different approaches - those that rely on 
Frequentist statistical procedures [9,10]; and those that 
use Bayesian statistical procedures [15,16]. 

Bayesian methods have the advantages of smaller 
expected sample sizes and ability to incorporate prior 
information. However, Bayesian methods are generally 
more computationally expensive than their frequentist 
counterpart due to the requirement to produce a poster- 
ior distribution [17]. In Bayesian methods, the degree of 



confidence is indicated via a parameter called, Bayes fac- 
tor threshold, whereas frequentist methods use error 
bounds (Type-1 (a) and Type-2 (/}) error). To say one is 
better than the other would be going into the old debate 
between Frequentist and Bayesian statistics. However, 
we prefer the frequentist approach since it allow us to 
explicitly state the error bounds, which is more intuitive 
to us. 

Frequentist statistical model checking 

Younes and Simmons formulate the probabilistic model- 
checking problem as a sequential hypothesis-testing pro- 
blem. Here, we call their algorithm Younes A [9], which 
is as follows: To verify a formula of the form Pr> s {y/}, a 
hypothesis test is setup between a null hypothesis HO: 
p > 8 + S against the alternative hypothesis HI: p < 8 - 
S. The factor d represents the indifference region around 
the threshold 8. This is represented in Figure 1. Algo- 
rithms based on sequential hypothesis testing need 
input parameters a, fi, 8 which specify the Type-1, 
Type-2 error bounds and the indifference region respec- 
tively. These parameters help in controlling the number 
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Figure 1 Probability of accepting HO : p > 0 + S vs the actual probability of the formula holding (adapted from 
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of samples and guaranteeing the desired error rates. For 
a fixed value of a and P, 5 decides the number of sam- 
ples needed to verify a property. It is inversely propor- 
tional to the number of samples required; i.e., the 
smaller 5 is, the more samples are needed. Also, the 
smaller 5 is, the lesser is the probability of p being in 
the region [0-5, 9 + 5]. 5 is a user-defined parameter 
whose choice is problem specific and usually involves 
iterative tuning. Hence, deciding the optimal value of 5 
affects the practical applicability of these algorithms. A 
sequential sampling algorithm based on Wald's sequen- 
tial probability test is used to solve the hypothesis test- 
ing problem. After taking the n th sample from the 
model, the factor/, is calculated as, 



_ " Pr [X, = Xj\p = 6-8] _ [6 - -{9- 3]]("-S.i*0 

~ \ \ Pr[Xi = Xi\p = 6 + S] ~ [fl + si(n,*)[i_r 9 + i]](»-Si*0 



(1) 



Hypothesis HO is accepted if/„ < B; Hypothesis HI is 
accepted if f„ > A; and if B < f„ < A, another sample is 
drawn. The constants A and B are chosen so that they 
result in a test of strength {a, /?). In practice, to satisfy the 

1-/J 



strength dictated by {a, P) choose A : 



and 



B 



1 



-. The algorithm satisfies the error bounds (a, P) 



only when the true probability does not lie in the indiffer- 
ence region, which is an issue. 

To address this issue, Younes discusses a modified algo- 
rithm (Algorithm 1), which we call Younes B here, that 
bounds the error for cases when the true probability lies in 
the indifference region by introducing a factor (7), which 
allows and controls the probability of undecided results 
(when the true probability is within the indifference 
region) [10]. Younes B uses two acceptance sampling tests: 

HO : p > 6 against HI : p < 6 — S with strength (a, y) 
HO' : p > 6 + S against HI' : p < 6 with strength (y, P) 

Pr>e{yA is reported as true when both HO and HO' are 
accepted. It is reported as false when both HI and HI' are 
accepted. Otherwise, the results is reported as undecided. 
It is not meaningful to distinguish between Pr >e {y/} and 
Pf>ff{w}> an d P r >e{w) can essentially be written as Pr 2f) {y/}. 
Therefore, it is sufficient to present on this case, Pr>g{i//}. 

Algorithm 1 ORIGINAL YOUNES B (0, 5, a, P, 7) 
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n <- 0, d <- 0; 

Al <r- log 
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-P 
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1 



; {Accept HI if f n > Ax] 
{Accept HO iff n < Bi] 

; {Accept HI/ if f n > A 2 ] 

; {Accept HO/ if f' n < B 2 ] 



repeat 

n <— n + 1; {Generates a new sample} 
if (x„ = = 1) then 

d <— d +1; 
end if 

/„ <- dlog — — + (n-d) log — — - — ; 

Q I Q 

fL-^d log + (n — d) log ; 

until! ((Bi </„ < Ai) \\(B 2 <f n < A 2 )) 
if ((/„ <Bi)&&(/^ <B 2 ))then 

Return TRUE; [p > 6} 
else if ((f n > Ai) && (f' n > A 2 ))then 

Return FALSE; {p % 0} 
else 

Return UNDECIDED; 
end if 

x„ is the outcome of the «th sample, 1 if true else 0 
a is the Type I error 
P is the Type II error 

Optimized Statistical Model checking algorithm (OSM) 

As discussed earlier, we aim to remove the manual selec- 
tion of the indifference region parameter. The rationale 
behind this is because, while the parameter is critical to 
the success of previous sequential hypothesis testing algo- 
rithms, it is very difficult for the user to select a suitable 
value. We combine ideas from the realms of verifying 
white box and black box to produce an algorithm that is 
practically superior. The essence of our proposed algo- 
rithms is similar to Younes B's two-acceptance-sampling- 
tests approach but, we make several critical changes which 
enhance them significantly. We describe our algorithm in 
the following subsections. 

Adjusting 5 automatically 

Instead of having to specify a difficult-to-determine 
indifference region (explained in detail later), we first 
assume it to be 1:0, which is the largest possible value. 
We start with a large 5 because, the larger 5 is, the 
fewer samples we need. We then proceed with using 
two simultaneous acceptance-sampling tests just like 
[10]. However, the crucial difference is that, whenever 
an undecided result is returned by the algorithm, we 
reduce 5 by half and check whether 1) a definite result 
can be given, 2) another sample is needed, or 3) a 
further reduction is required. We continue this process 
until a definite result is produced. The details are given 
in Algorithm 2. 

Algorithm 2 OSM A (0, a, ft) 

5 <- 1, 7 <- m in (a, P), n <- 0, d <- 0; 

while (true) do 

(n, d, y) <— Incremental Younes B{0, 5, a, P, y, n, d); 
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if ((y = = TRUE) ||0' = = FALSE)) then 

Return y; 
else 

(5 <— <5/2; {Undecided with current 8, halve it} 
end if 
end while 

Functionlncremental Younes B(9, 8, a, [5, y, n, d); 
„ , , i-y r . t t i „ , 



Bj <- log-^— {Accept HO if/„ < Bi} 
1 — a 1 ' 





a 




Y 


1 


— a 


1 


-P 




Y 




P 


1 


- Y 



B 2 <- log y^j; {Accept HO/ if£ < B 2 } 
repeat 

n <— k + 1; {Generates a new sample} 
if (x„ = = 1) then 

d <— ^ +1; 
end if 

f n <- dlog — — + (n-d) log — — - — ; 

q ^ Q 

fL-^d log + (n — d) log -; 

ln & 0 + S y ' 8 l-(6>+<5) 

until! ((Bi </„ < Ai) \\(B 2 <f n < A 2 )) 
if ({fn <B 1 )&&(f n <B 2 )) then 

Return (n, d, TRUE); {p > 0} 
else if (if n > Ai) && (ft > A 2 )) then 

Return (n, d, FALSE); {p % 6} 
else 

Return (n, d, UNDECIDED); 
end if 

x„ is the outcome of the «th sample, 1 if true else 0 

a is the Type I error 

B is the Type I error 

n is the number of samples 

d is the number of samples satisfying y/ 
Our algorithm (OSM A) has three advantages over 
previous works. First, a predetermined user-defined 
indifference region 8 is not required. Secondly, the num- 
ber of samples required adjusts automatically to the dif- 
ficulty of the problem, i.e., depending on how close p is 
to 9, by starting with the largest possible indifference 
region. Finally, our algorithm always gives a definite 
result if sufficient samples are given and, that result is 
guaranteed to be error bounded. 

However, if p is very close to 8, the indifference region 
needs to be reduced to a very small value such that 8 < 
\p - 0\. If 8 is very small, the sample size required to 
determine a result will be very large or, in the worst 
case where p = 6, this algorithm will not terminate. 



Therefore, while such an algorithm is superior in theory, 
it may be limited in some situations in practice. Hence, 
in sub-section, we further improve this algorithm by set- 
ting a limit on the sample size. This will ensure that 
the program completes in a user-acceptable runtime to 
handle such unlikely but possible situations. 

The ability of OSM A to control errors is obviously 
dependent on Younes B algorithm's ability to control 
them. Therefore, interested readers are referred to [10] 
where they provide proofs for the strength of two accep- 
tance sampling tests. In this paper, we empirically 
demonstrate in section that OSM A consistently has the 
ability to control errors in various settings. 

Based on Algorithm 2, as OSM A repeatedly calls Incre- 
mental Younes B, it would require much more samples to 
be generated than Younes B. However, that is not true. 
This is because OSM A reuses samples from previous 
iterations (with a different 8) instead of starting from 
scratch with each call. Therefore, the number of samples 
needed to be generated by OSM A is actually the same as 
Younes B given the same 9, a, fi, y and 8. It is possible to 
reuse samples from different iterations because, given the 
same 8, a, j3, and y if the Younes B algorithm running at a 
larger value of 8 terminated at a value of n but returned 
UNDECIDED, then the Younes B algorithm running at a 
smaller value of 8 would not terminate and return TRUE/ 
FALSE at that same value of n (though it would terminate 
at a higher value of n and return a TRUE/FALSE/UNDE- 
CIDED answer) (Figure 2). 

Limiting the number of samples 

By limiting the sample size, we can bound the runtime 
of the program but we may not be able to bound the 
error rates. Therefore, we compute a p-value to serve as 
a measure of the confidence of the result. The modified 
algorithm is as follows. As before, we first assume 8 to 
be the largest possible, i.e., 1:0. Then we proceed using 
two simultaneous acceptance-sampling tests and, when- 
ever an undecided result is given, we reduce 8 by half 
and check whether 1) a definite result can be given, 2) 
another sample is needed, or 3) a further reduction is 
required. We continue this process until a definite result 
is given or when the sample size limit is reached. If a 
definite result is reached before the sample size limit, 
then the error rate is guaranteed to be bounded. Other- 
wise, if the sample size limit is reached, we compute the 
p-value for both hypotheses HO: p > 8 and HI: p < 6, 
and accept the hypothesis with the lower p-value. The 
p-values are computed using the method presented in 
[12] - viz., the p-value for HQ is 1 - F(d; n, 0) and the 
p-value for HI is F(d; n, 8), where d is the number of 
successes (or true), n is the total number of samples, 
and F(d; n, 8) is the Binomial cumulative distribution 
function, 
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Figure 2 Log 2 expected value of n at which the Younes B algorithm terminates and returns a TRUE/FALSE/UNDECIDED answer for 
different values of <5 (0.02, 0.05 and 0.1) at varying 6 with a = 0.01, /} = 0.01, y= 0.01 and p = 0.7 This figure demonstrates that with 
decreasing <5, the expected n increases. 



F(d;n,9) =J2[^ i jQ i (l-6) n ~ 1 ( 2 ) 

With this, we have developed an algorithm that 1) 
does not require the user to predetermine a suitable 
indifference region, 2) is guaranteed to bound specified 
Type-1 and Type-2 errors if sufficient samples can be 
generated, and 3) terminates and returns a confidence 
measure even in the rare event when p is extremely 
close to or equal to 6. We call the above algorithm 
OSM B. 

In the next section, we demonstrate the superiority of 
our proposed algorithms against current state-of-art, 
first with a straightforward yet representative example 
followed by applying to a real biological model. 

Results 

For a fair comparison across different algorithms, we 
need to define the performance measures of interest. In 
model checking, simulation runs are typically the most 
computationally expensive and obtaining accurate con- 
clusions about the model is of paramount importance. 
Therefore, the most desirable situation would be to 
obtain accurate conclusions of the model's behavior 
with the minimum number of simulation runs. As such, 
we use error rates and simulation runs (or samples) 
required of each algorithm as the basis for judging 
superiority in our comparison. 



Simple model 

Here, we use a simple uniform random generator that 
produces real numbers in the range of [0, 1] as our 
probabilistic simulation model. Suppose the property 
that we are testing is whether p > 9, and we fixed p (the 
true probability) to 0.3. To generate a sample, we use 
the uniform random generator to generate a random 
number and, the sample is treated as a true sample if 
and only if the generated value is lesser than p. We vary 
9 from [0.01, 0.99] (except p which is 0.3) with an inter- 
val of 0.01 and set 8 to be 0.05 and 0.025 for Figure 3a, b 
and 3c, d respectively. For each setting, the experiments 
are repeated 1000 times with a (Type-1 error rate) and /3 
(Type-2 error rate) of 0.01. We also limit the sample size 
for OSM B to be 3000. 

Figure 3 shows how critical and difficult the selection 
of 8 is for Younes A and Younes B. Too large, the error 
and undecided rates within the wide indifference region 
are unbounded and high (Figure 3a). On the other hand, 
if 8 is too small, then the number of samples required 
grows rapidly in the indifference region (Figure 3d). 

Indeed, if a suitable 8 can be chosen for Younes A and 
Younes B, the error rate is bounded and minimum sam- 
ples are used. However, it is a difficult task to choose an 
ideal 8 that balances the samples required and the error 
rates unless one has a good estimate of p (the true prob- 
ability), which is unrealistic. 

Furthermore, it should be noted that the Younes A 
algorithm does not provide information on whether the 
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Figure 3 Plots a & b are with an indifference region of 0.05 whereas c & d are with an indifference region of 0.025 for the small 
synthetic model. 



error rate is bounded or not, i.e., whether p is within or 
outside the indifference region. This implies that the 
user may come to a false conclusion that the result is 
bounded with a certain error rate when it is actually not 
(Figure 3a and 3c). 

While Younes B algorithm does indeed always bound 
the error rate when a definite result is given, it comes at 
the expense of a large number of undecided results 
when p is inside the indifference region. This means the 
algorithm uses up computational resources and, in the 
end, returns an undecided result, which is undesirable. 

Our proposed algorithm (OSM A) overcomes all these 
problems. First, the tough decision of choosing the 
indifference region is not required as the algorithm does 
do so dynamically and error rates are always bounded 
(Figure 3a and 3c). However, OSM A has a limitation in 
that it requires rapidly increasing number of samples as 
6 closes in on p (Figure 3b and 3d). 

OSM B removes this limitation by limiting the number 
of samples and ensures termination (Figure 3b and 3d). 
We should note that whenever OSM B returns a definite 



answer, the error is guaranteed to be bounded and, when 
the sample limit is reached, a confidence measure 
(p-value) is given. Therefore, it is clear to the user when a 
result is guaranteed to be error bounded and when it is 
not. 

Cell fate model of gustatory neurons with microRNAs 

Next, we perform model checking on the cell fate determi- 
nation model of gustatory neurons of Caenorhabdities ele- 
gans [18]. This model studies the regulatory aspects 
mediated by miRNA's on the ASE cell fate in C.elegans 
and focuses on a double negative feedback loop which 
determines the cell fate (Figure 4). A precursor cell state 
have equivalent potential to transit into the stable ASEL 
or ASER terminal state. While ASEL and ASER are physi- 
cally asymmetric, they are morphologically bilaterally sym- 
metric. It is believed that the cell fate (ASEL or ASER) is 
controlled by miRNA (lsy-6 and mir-273) in the double 
negative feedback loop. The computational model con- 
tains 22 entities (RNA or protein) and 27 processes (biolo- 
gical reactions). We first use a property from [7], where it 
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validates that the concentration of LSY-2 in the nucleus 
will never increase if it has risen and fallen once pre- 
viously, to illustrate the technical superiority of our pro- 
posed algorithms even in real biological examples. We will 
then discuss its practical implication in the next section. 

As before, we vary 6 from [0.01, 0.99] (except p which 
is estimated to be 0.25) with an interval of 0.01 and set 
8 to be 0.05 and 0.025 for Figure 5a, b and 5c, d respec- 
tively. For each setting, the experiments are repeated 
1000 times with fixed a (Type-1 error rate) and j3 
(Type-2 error rate) of 0.01. We again limit the sample 
size for OSM B to be 3000. Using a separate experiment 
of 10,000,000 simulation runs, we had estimated that 
the true probability, p, of this property to be 0.25 (in the 
10 million run results, approximately 25% of them satis- 
fied the property). 

Figure 5 again demonstrates the superiority of our 
proposed algorithms over current state-of-the-art. To 
give an even clearer picture of the advantages of our 
algorithm, we shall look at the cross-section of a few 
crucial data points (Table 1). 

Firstly, when 0 is distant from p, the problem is easy. 
Ideally, algorithms should use minimum amount of 
samples while maintaining the error bound. In Table 1, 
at 0 = 0.5, although all algorithms kept well within the 
error bounds but Younes A and B both requires much 
more samples than OSM A and B on average. 

As 6 approaches p, understandably more samples 
would be required to make an accurate conclusion. In 



these situations, the priority would typically still be to 
ensure error rates are under control while not using an 
exorbitant number of samples. Based on Table 1, at 9 = 
0.28, error rate of Younes A and B are dependent on 
the choice of d. If user is able to choose S to be 0.025, 
errors are low (Younes A made 2 errors while Younes B 
made no error) but if user made a wrong choice, d = 
0:05, it would be disastrous (Younes A made 54 errors 
while Younes B made 254 errors/undecided). Since S is 
not a parameter for OSM A and B, their performance 
are consistent, with error rates within 1% (or 10 errors 
in 1000 runs) and average sample size around 2000. 

In the event where 6 is extremely close to (or equal) p, 
it is hard (or impossible) to accurately decide unless we 
have huge (or infinite) samples. Therefore, one could 
only choose between high accuracy or minimum sam- 
ples. Our proposed algorithms are useful each in one 
situation. If high accuracy is desired by the user, OSM 
A is suitable. As shown in Table 1, 6 = 0.26, OSM A is 
constantly keeping errors close to 1% or 10 errors. If 
computation limitation is of concern to the user, OSM 
B could be used to maintain sample size limit. Younes 
A seems to perform better than OSM B, since it uses 
less samples and have relatively similar errors. However, 
it is important to note that this is actually not true 
because, when OSM B could not guarantee the error 
rate, it returns a p-value (107 errors are made by 
p-value), instead of the typical true or false conclusion, 
which would alert the user to be cautious. In contrasts, 
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Figure 5 Plots a & b are with indifference region of 0.05 whereas c & d are with indifference region of 0.025 for the cell fate model. 

As expected, this figure looks identical to Figure 3 as these algorithms do not make assumptions on the underlying stochastic model. 



Younes A does not have such differentiation and might 
mislead user to trust its decision. Furthermore, the 
value of each resulting p-value can be used as another 
red flag, as OSM B tends to be correct when the 
p-value is small and incorrect when the p-value is large 
(Figure 6). As for Younes B, it is even worse, it would 
run thousands of simulations and give undecided con- 
clusion, which is not very useful, up to 93.7% of the 
time. 

Practical implications 

In the previous few sections, we have shown the super- 
iority of our algorithms from a technical standpoint. In 
this section, we discuss the practical implications of our 
algorithms. In particular, we use model checking to ver- 
ify two behaviors of the ASE pathway model. 



Equivalent potential to transit into ASER or ASEL 

One important application of model checking is using it 
to ensure that the created simulation model exhibits 
behaviors that are widely accepted. In literature, it is sta- 
ted that a precursor ASE cell state should have equiva- 
lent potential to transit into stable ASER or ASEL. 
Therefore, we need to first ensure that the ASE pathway 
model created exhibits this behavior before we can 
deem the model to be correctly built and utilize it to 
perform any downstream analysis. 

Suppose we accept equivalent potential to be between 
45% and 55%, which means the simulation model should 
transit into ASER or ASEL with a probability of 0.45 to 
0.55. Translating this to model checking language would 
mean that ASER terminal cell fate markers (such as gcy5) 
should be more abundant than ASEL terminal cell fate 
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Cross-section of Figure 5 where 0 = (0.5, 0.28 and 0.26). At 6 = 0.26, total errors made by OSM B is 1 14 out of which 107 is due to p-value (sample limit reached). 
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Figure 6 P-value distribution of OSM B (from Figure 5) when 0 = 0.26. Average p-value for OSM B_Correct is 0.147 whereas average p- 
value for OSM BJncorrect is 0.357. 



markers (such as gcy6) after some simulation time with a 
probability of 0.45 to 0.55. More formally, the PLTLs for- 
mat would be P > o.45G([gcy5] > [gcy6]) {[time] > 300} 
AND P < o.5sG{[gcy5] > [gcy6]) {[time] > 300}. Readers 
unfamiliar with temporal logics and model checking in 
systems biology can find the relevant background materi- 
als in [19-21]. 

By using a separate, computationally expensive experi- 
ment of 10,000,000 simulation runs, we found that the 
ASE model transits into ASER and ASEL with approxi- 
mately 46% and 54% probability respectively. Therefore, 
the correct conclusion to be given by the algorithms 
should be to accept that the model as correct. 

Assuming that a user wants the error rate to be 
around 1% (a = 0.01 ji = 0.01), and has chosen S to be 
0.025. Table 2 shows that there is a 13.1% probability 
that Younes A incorrectly rejects the model while there 
is a 70.3% probability that Younes B replies with an 
undecided response. 

On the other hand, as shown in Table 2, OSM A only 
gives a wrong conclusion with 1.2% probability. How- 
ever, OSM A requires, on average, > 23,000 simulation 
runs to make a decision, which could be much more 
than the available computational resources to the user. 
In such cases, the user can still depend on OSM B 
where it needs only about 2,800 simulation runs on 
average, with only 1.2% probability of giving a wrong 
conclusion. The rest of the 11.4% wrong decisions given 



by OSM B is when computational resources are maxed 
out and OSM B returns a p-value instead of the true or 
false response. This should alert the user to be more 
cautious of the conclusion. 

lsy-2 in the nucleus will never increase if it has risen and 
fallen once previously 

Suppose that after validating the model, we are now 
interested in investigating whether the ASE model exhi- 
bits the following behavior: There is more than 28% 
probability that the concentration of lsy-2 in the nucleus 
will never increase if it has risen and fallen once pre- 
viously. Translating this to PLTLs would be P > o.2&((d 
([lsy2N ]) > 0) U(G{d{[lsy2N]) < 0))). 

Once again, by using a computationally expensive, 
separate experiment of 10,000,000 simulation runs, we 
have found that the model only exhibits this behavior 
approximately 25% of the time. Therefore, the correct 
conclusion to be drawn by the algorithms should be the 
model does not exhibit this behavior more than 28% of 
the time. 

Assume a user wants the error rate to be around 1% 
(a = 0.01 P = 0.01) and has chosen S to be 0.05. This time, 
there is a 5.4% probability that Younes A gives a wrong 
conclusion while there is a 25.4% probability that Younes 
B gives a wrong or undecided conclusion, whereas there is 
only a 0.5% probability that OSM A and OSM B make a 
wrong conclusion (Table 1). On the other hand, if the user 
had chosen a smaller 8 ( = 0.025), they would have been 
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1000 times. 



able to control the error rates (Table 1). Therefore, one 
naive strategy would be to always choose an extremely 
small S that is close to 0. However, since the expected 
number of samples of Younes A and Younes B are inver- 
sely proportional to S 2 [13], such a strategy would have 
required an exorbitant number of simulation runs. 

In the two scenarios above, we have chosen different 
values of S for Younes A and Younes B. Unfortunately, it 
was insufficient in both cases, causing Younes A and 
Younes B to not perform well (i.e. keeping error rates 
under control). This clearly shows that their success or fail- 
ure depends heavily upon the value of 8 and, in practice, it 
is unrealistic to expect users to be able to provide a suitable 
8 for every scenario. Therefore, eliminating the need for 
users to decide on the value 8, and dynamically selecting 
the optimal value depending on the situation, is a useful 
practical solution as proposed in OSM A and OSM B. 

Discussion 

In this paper, we have presented two algorithms (OSM A 
and OSM B) that are similar but serve different purposes. 
OSM A is recommended when computational resources 
are plentiful and/or bounding the error rates is a priority. 
In the situation where computational resources are lim- 
ited, OSM B is useful. While these algorithms are founded 
upon a simple idea, the improvements over current state- 
of-the-art algorithms are significant and practically useful. 
Firstly, our algorithms do not require the critical, yet diffi- 
cult to determine indifference region as an input para- 
meter. Secondly, our algorithms adjust automatically to 
the difficulty of the problem by dynamically halving the 
indifference region, leading to using fewer samples when p 
is far away from 6. Lastly, it always returns a definite 
response to the user, which is either guaranteed to be 
error bounded given sufficient samples or comes with a 
confidence measure if computational resources are 
limited. 

Therefore, we foresee the usage of these algorithms to 
be wide as there is no assumption or requirement of the 
simulation model, allowing them to be applied to any 
stochastic system analysis. 
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